Accepted Manuscript
Confidence distributions for change-points and regime shifts
Céline Cunen, Gudmund Hermansen, Nils Lid Hjort
PII: S0378-3758(17)30166-0
DOI: https://doi.org/10.1016/j.jspi.2017.09.009
Reference: JSPI 5595
To appear in: Journal of Statistical Planning and Inference
Please cite this article as: Cunen C., Hermansen G., Hjort N.L., Confidence distributions for
change-points and regime shifts. J. Statist. Plann. Inference (2017),
https://doi.org/10.1016/j.jspi.2017.09.009
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form.
Please note that during the production process errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
CONFIDENCE DISTRIBUTIONS FOR CHANGE-POINTS
AND REGIME SHIFTS
C´eline Cunen, Gudmund Hermansen and Nils Lid Hjort
Department of Mathematics, University of Oslo
Abstract. Suppose observations y
1
, . . . , y
n
stem from a parametric model f (y, θ),
with the parameter taking one value θ
L
for y
1
, . . . , y
τ
and another value θ
R
for
y
τ +1
, . . . , y
n
. This article provides and examines two different general strategies for
not merely estimating the break point τ but also to complement such an estimate
with full confidence distributions, both for the change-point τ and for associated
measures of differences between the two levels of θ. The first idea worked with
involves testing homogeneity for the two segments to the left and the right of a
candidate change-point value at a fine-tuned level of significance. Carrying out
such a scheme requires having a goodness-of-fit test for constancy of the θ param-
eter over a segment of indices, and we also develop classes of such tests. These
also have some independent interest. The second general method uses the log-
likelihood function, profiled over the other parameters, and we show how this may
lead to confidence inference for τ . Our methods are illustrated for four real data
stories, with these meeting different types of challenges.
Key words: change-points, confidence distributions, homogeneity testing, log-likelihood
profiling, monitoring bridges, regime shifts, Tirant lo Blanch
1. Introduction and summary
Many types of processes and natural phenomena experience change-points, some-
times via a jump in mean level and on other occasions via different and perhaps more
subtle changes of behaviour. Such changes and discontinuities, when parameters of
a model change from one state to another, are variously called break-points, tipping
points, paradigm or regime shifts, structural changes or critical transitions, depend-
ing on the type or school of application. There is naturally a vast literature inside
several areas of application, from engineering (see e.g. Frick et al. (2014)), econom-
ics and finance, to biology (e.g. Gould & Eldredge (1977)), meteorology, geology,
climate, sociology and history (cf. Spengler (1918), Fukuyama (1992)). As Glad-
well (2000) writes in The Tipping Point, “the tipping point is that magic moment
when an idea, trend, or social behavior crosses a threshold, tips, and spreads like
wildfire”. There is similarly a large literature regarding aspects of estimation and
assessment of change-points inside statistical methodology. The present paper is
Date: August 2017.
1
2 CONFIDENCE FOR CHANGE-POINTS
a contribution to the methodological side but also presents real data applications
stories. Our methods aim at spotting change-points, but, importantly, along with a
full assessment of uncertainty, in the form of confidence distributions.
A fruitful statistical framework is as follows. Suppose y
1
, . . . , y
n
are independent
from a model with density say f(y, θ), with θ of dimension say p. Our theme is that
of pinpointing and providing full inference for the break point τ, assumed to exist,
where the θ associated with y
1
, . . . , y
τ
is equal to one value, say θ
L
, whereas the
parameter vector behind y
τ+1
, . . . , y
n
, say θ
R
, is different. For various applications
it may be necessary to extend this framework to models with dependence, as for
time series, and several of our methods work also for such cases. The statistical
challenge is to estimate τ, along with measures of uncertainty. The traditional ways
of reporting precision of parameter estimates are via standard errors (estimates
of standard deviation) or say 95% confidence intervals. Our preferred format is
that of a full confidence curve, say cc(τ, y
obs
), based on the observed dataset y
obs
.
Its interpretation is that, at the true change-point parameter τ, the set R(α) =
{τ : cc(τ, Y ) α} ought to have probability approximately equal to α, with Y
denoting a random dataset drawn from the model; see Schweder & Hjort (2016)
for a full account of confidence distributions. In particular, confidence sets at any
confidence level can be read off from the confidence curve.
The theory and applications of confidence distributions work out more easily for
continuous parameters in smooth models, for several reasons. First, for a continuous
parameter there is then a possibility of having exact or nearly exact confidence dis-
tributions, in the sense that R(α) given above has probability equal to or very close
to α, for each confidence level α. This is not fully attainable for the present case
of change-point parameters, as the natural statistics informative for τ , like a point
estimator bτ, have discrete distributions. Secondly, various methods and results per-
taining to continuous parameters of smooth models, related to exact or approximate
distributions for such statistics, like large-sample normality or chi-squaredness of
deviances, are not valid and have no clear parallels when it comes to inference for τ .
Confidence distributions and confidence curves may nevertheless be fruitfully con-
structed for various situations with discrete parameters, as developed in Schweder
& Hjort (2016, Ch. 3). This is also the line of development and investigation for the
present paper.
In Sections 2 and 3 we propose two different general methods for obtaining such
confidence curves for change-points. The first of these requires having a homogeneity
test for each given segment of data points where the hypothesis of no change can be
accurately examined. For this reason we develop classes of general goodness-of-fit
tests for such homogeneity hypotheses in Section 4. Tests we develop there, based on
CONFIDENCE FOR CHANGE-POINTS 3
successive log-likelihood maxima, ought also to have independent interest. Questions
regarding behaviour and performance of our different confidence distributions are
then treated in Section 5.
The methods we develop in this paper are then shown at work for four different
stories with real data, each involving separate challenges. In Section 6 a Poisson
model is used to assess British mining disasters 1851–1962, with confidence infer-
ence for both the change-point and the relative change. In Section 7 we use different
versions of our methodology to pinpoint precisely where the second author (Marti
Joan de Galba) took over for the first author (Joanot Martorell), in what is arguably
the world’s first proper novel, Tirant lo Blanch, published in Val`encia 1490. Sev-
eral scholars have previously worked with multinomial models for word sizes, but
we demonstrate that the data are overdispersed, inviting the use of multinomial-
Dirichlet type models. Then in Section 8 we consider a time series of the number of
skiings days at a certain place near Oslo, from 1897 to 2014, and where the question
is precisely when Nature started changing her ways. Finally in Section 9 we examine
an important and long-running time series from fisheries sciences, consisting of the
liver quality of skrei (the North-East Atlantic cod, Gadus Morhua), from 1859 to
2013, along with potentially influencing covariates. The aim is again to pinpoint
where a moderately complex model for such data experiences a regime shift. We
end our article by offering a list of concluding remarks in Section 10, some pointing
to further relevant research for change-point inference.
For further pointers to the statistics literature, regarding both methods and ap-
plications, see e.g. Frigessi & Hjort (2002) for a general discussion of discontinuities
in statistical models, in their introduction to a special issue of Journal of Nonpara-
metric Statistics on such topics, and the edited volume Carlstein et al. (1994). Frick
et al. (2014) develop methods for joint inference of multiple jumps in a certain class
of models, and references in that paper give pointers to several other approaches
to change-point analyses. For Bayesian approaches, consult Carlin et al. (1992),
Fearnhead (2006), and also Section 5.3 below.
2. General method A: via tests of homogeneity
Though we shall work with models with dependence later in our paper, we
assume in the present section that the Y
i
are independent, with density f(y, θ
i
) for
observation i. Assume further that we for each given n have managed to construct a
well-working goodness-of-fit test for the homogeneity hypothesis H
1,n
: θ
1
= ··· = θ
n
,
say Z
1,n
, with null distribution G
1,n
. Testing H
1,n
at level 0.05, for example, is then
carried out by rejecting if Z
1,n
> G
1
1,n
(0.95), etc. We shall come back to classes of
such tests in Section 4.
4 CONFIDENCE FOR CHANGE-POINTS
Consider now the regime shift setup, where the θ
i
are equal to a θ
L
for i =
1, . . . , τ but equal to a different θ
R
for i = τ + 1, . . . , n. To form a confidence set for
τ, at confidence level α, we suggest forming
R(α) = {τ : H
1
is accepted at level
α, H
τ+1,n
is accepted at level
α}
= {τ : Z
1
G
1
1
(
α), Z
τ+1,n
G
1
τ+1,n
(
α)}
(2.1)
for each of a grid of α values. The probability that τ belongs to this random set,
under the true τ, is then
P
τ
{τ R(α)} =
α
α = α.
Note that R(α) consists of points, seen as candidate values for τ at level confidence
α, not an interval, per se; also, it may not be connected, as seen in e.g. Figure 9.2.
For an easy illustration, suppose Y
i
N(θ
i
, 1). Here there is a simple test for
homogeneity for any given segment of observations using Q
a+1,a+b
=
P
a+b
i=a+1
(Y
i
¯
Y )
2
, with
¯
Y =
¯
Y
a+1,a+b
the average, and which has a simple χ
2
b1
null distribution
(other tests will be considered in Section 5). Thus we may easily find the set
R(α) = {τ : Q
1
H
1
τ1
(
α), Q
τ+1,n
H
1
nτ1
(
α)} (2.2)
for each confidence level α, writing H
ν
(·) for the distribution function of a χ
2
ν
. The
R(α) sets can then be displayed for α values 0.01, 0.02, 0.04, . . . , 0.96, 0.98, 0.99, say.
Simple simulations reveal that the R(α) sets for given confidence level α might not be
connected, i.e. may consist of a union of different connected sets, and also that they
may be empty for smaller levels. The sets R(α) can be used to define a confidence
curve for our method A,
cc
A
(τ, y) = min{α : τ R(α)}. (2.3)
This curve fulfils the important property that it will be uniformly distributed at the
true change-point value τ
0
. This is easily established by realising that cc
A
(τ
0
, Y ) α
is equivalent to τ
0
R(α). For some further properties of this confidence curve, see
Section 5.
The
α
α = α idea works of course also with other combinations, like using
α
τ/n
α
1τ/n
for the τ under scrutiny, which we find tends to work slightly better
in terms of leading to somewhat slimmer confidence sets; see Section 5. For the
illustration just considered, cf. (2.2), there are other homogeneity tests that may be
used, in addition to the simple chi-squared method used there, and some alternatives
are worked with in Section 5.
In more complex models the situation is less clear-cut than for the illustration
around eq. (2.2), not due to any conceptual difficulties with method (2.1), but be-
cause we may not have a test of homogeneity with an exact null distribution fully
free of parameters. As long as there is a decent test Z
1,n
, for each stretch 1 to n, with
CONFIDENCE FOR CHANGE-POINTS 5
a null distribution exactly or approximately independent of any underlying parame-
ters, we are very much in business, however. We discuss two classes of such tests in
Section 4. It is also important to realise that the (2.1) method works in complicated
and perhaps high-dimensional situations, as long as there is such a homogeneity test.
A case in point is the nonparametric graph-based scan statistics method of Chen
& Zhang (2015), which may be put to work as long as a similarity measure on the
sample space can be given. In fact Chen & Zhang (2015) utilise an idea similar to
our (2.1) above, but in the context of constructing a single confidence interval inside
a special model framework only; our concern is that of a full confidence curve, and
we emphasise the broad generality of the approach. One may also find traces of
related ideas, such as for blocking parameters into groups and identifying splits, in
Cox & Spjøtvoll (1982); Worsley (1986). We take time to mention that a Bonferroni
version of the argument may be used in cases where data from the left and right
segments are dependent, with
1
2
+
1
2
α replacing
α in (2.1, yielding an alternative
set R
b
(α); this secures a conservative P
τ
{τ R
b
(α)} α. The difference is actually
slight for confidence levels α >
1
2
and very small for the higher levels.
3. General method B: profiled log-likelihood and deviance
Suppose in general terms that Y
1
, . . . , Y
τ
come from f(y, θ
L
) and Y
τ+1
, . . . , Y
n
stem from f(y, θ
R
). This corresponds to a log-likelihood function of the form
`(τ, θ
L
, θ
R
) =
X
iτ
log f(y
i
, θ
L
) +
X
iτ+1
log f(y
i
, θ
R
) = `
1
(θ
L
) + `
τ+1,n
(θ
R
).
We shall see how profiled versions may lead to confidence distributions, for both the
breakpoint position τ and for the degree of change, suitably measured.
3.1. Confidence for the breakpoint. From the function above we may compute
the profile log-likelihood function
`
prof
(τ) = max
θ
L
R
`(τ, θ
L
, θ
R
) = `(τ,
b
θ
L
(τ),
b
θ
R
(τ))
= `
1
(
b
θ
L
(τ)) + `
τ+1,n
(
b
θ
R
(τ)),
(3.1)
involving the maximisers of `(τ, θ
L
, θ
R
) over θ
L
and θ
R
for given τ. The maximiser of
`
prof
is the maximum likelihood (ML) estimator bτ, yielding also the ML estimators
b
θ
L
=
b
θ
L
(bτ) to the left,
b
θ
R
=
b
θ
R
(bτ) to the right. From the profile we form and display
the deviance function
D(τ, Y ) = 2{`
prof
(bτ) `
prof
(τ)}. (3.2)
To construct a confidence curve for τ based on the deviance, consider the esti-
mated distribution of D(τ, Y ) at position τ,
K
τ
(x) = P
τ,
b
θ
L
,
b
θ
R
{D(τ, Y ) < x}.
6 CONFIDENCE FOR CHANGE-POINTS
The Wilks theorem says that K
τ
(x) is approximately the distribution function of a
χ
2
1
, in the case of parametric models smooth in its continuous parameters. There is
no Wilks theorem in the present case of a discrete-valued parameter τ, however, so
we typically need to resort to computing K
τ
(x) by simulation. Also, D(τ, Y ) has
a discrete distribution, say with positive point probabilities k
τ
(x) for certain x; in
particular, there is a positive probability k
τ
(0) = P
τ,
b
θ
L
,
b
θ
R
{bτ = τ } that the deviance is
zero. Hence the probability transform K
τ
(D(τ, Y )) does not have an exact uniform
distribution.
We shall nevertheless work with the construction
cc(τ, y
obs
) = K
τ
(D(τ, y
obs
)) = P
τ,
b
θ
L
,
b
θ
R
{D(τ, Y ) < D(τ, y
obs
)}. (3.3)
The probability that cc(τ, Y ) α, under the true change-point parameter τ, is often
well approximated with α, allowing the interpretation that confidence sets for τ can
be read off from a plot of cc(τ, y), which we call a confidence curve. The cc(τ, y) of
(3.3) is the acceptance probability for τ, or one minus the p-value for testing that
value of τ , using the deviance based test which rejects for high values of D(τ, Y ).
We compute K
τ
and hence cc(τ, y) by simulation, i.e.
cc(τ, y
obs
) = B
1
B
X
j=1
I{D(τ, Y
j
) < D(τ, y
obs
)},
for a large enough number B of simulated copies of datasets Y
. This needs to be
carried out for each candidate value τ, with generated data Y
i
from f(y,
b
θ
L
) to the
left of τ and f(y,
b
θ
R
) to the right of τ. For a related idea see Section 10.1.
3.2. The normal case. An important special case of our general problem formu-
lation is that of the normal with constant variance. Consider first the case where
this variance is known, for convenience now taken to be one. With levels ξ
L
and ξ
R
to the left and to the right, the log-likelihood is
`(τ, ξ
L
, ξ
R
) =
1
2
X
iτ
(y
i
ξ
L
)
2
1
2
X
iτ+1
(y
i
ξ
R
)
2
,
leading to `
prof
(τ) =
1
2
{Q
L
(τ) + Q
R
(τ)}, with Q
L
(τ) =
P
iτ
{y
i
¯y
L
(τ)}
2
and
Q
R
(τ) =
P
iτ+1
{y
i
¯y
R
(τ)}
2
, writing ¯y
L
(τ) and ¯y
R
(τ) for the averages to the left
and the right of τ. The ML for τ is the value minimising the sum of these empirical
variances to the left and the right, or, equivalently, maximising τ ¯y
L
(τ)
2
+ (n
τ)¯y
R
(τ)
2
. Confidence statements for τ can then be reached via the recipe above,
based on the deviance
D(τ, y) = Q
L
(τ) + Q
R
(τ) Q
L
(bτ) Q
R
(bτ).
CONFIDENCE FOR CHANGE-POINTS 7
Next assume that the model takes N(ξ
L
, σ
2
L
) to the left and N(ξ
R
, σ
2
R
) to the right,
with the four parameters being unknown, in addition to the breakpoint. Maximising
the log-likelihood
` = τ log σ
L
1
2
(1
2
L
)[Q
L
(τ) + τ{¯y
L
(τ) ξ
L
}
2
]
(n τ) log σ
R
1
2
(1
2
R
)[Q
R
(τ) + (n τ){¯y
R
(τ) ξ
R
}
2
]
over first ξ
L
, ξ
R
and then σ
L
, σ
R
yields the profile log-likelihood function
`
prof
(τ) = τ log bσ
L
(τ) (n τ) log bσ
R
(τ),
where bσ
L
(τ)
2
= (1)
P
iτ
{y
i
¯y
L
(τ)}
2
and similarly with bσ
R
(τ)
2
. We see that the
ML estimate of τ is the value that minimises bσ
L
(τ)
τ/n
bσ
R
(τ)
1τ/n
. Also,
D(τ, y) = 2{τ log bσ
L
(τ) + (n τ) log bσ
R
(τ) bτ log bσ
L
(bτ) (n bτ) log bσ
R
(bτ)},
and a confidence curve can be based on this, as per (3.3).
In this brief section on inference for the breakpoint in the normal model we finally
include the important case where data follow N(ξ
L
, σ
2
) to the left and N(ξ
R
, σ
2
) to
the right, i.e. with a common σ. This requires a modest extension of (3.1)–(3.2) to
the case of common parameters being present on both sides of the breakpoint. The
point is that recipe (3.3) for the confidence curve is still operable and valid. The
log-likelihood function for this four-parameter model becomes
n log σ
1
2
(1
2
)[Q
L
(τ) + τ{¯y
L
(τ) ξ
L
}
2
+ Q
R
(τ) + (n τ){¯y
R
(τ) ξ
R
}
2
],
which is easily maximised over (ξ
L
, ξ
R
, σ) for each fixed τ. We find
bσ(τ)
2
= n
1
{Q
L
(τ) + Q
R
(τ)},
and `
prof
(τ) = n log bσ(τ). The ML for τ is the bτ making bσ(τ) smallest. Also,
D(τ, y) = n log
bσ
2
(τ)
bσ
2
(bτ)
.
3.3. The multinormal case. Assume the observations y
i
are multivariate and nor-
mally distributed. Here we derive the required formulae for log-likelihood maxima
and deviance functions, under two scenarios, corresponding to having the variance
matrix constant or not, for the left and the right part of the data.
When y
1
, . . . , y
n
are i.i.d. N
p
(ξ, Σ), the log-likelihood function is
`
n
=
1
2
n log |Σ|
1
2
n
X
i=1
(y
i
ξ)
t
Σ
1
(y
i
ξ)
1
2
n log(2π).
This is maximised by
b
ξ = ¯y and
b
Σ = n
1
P
n
i=1
(y
i
b
ξ)(y
i
b
ξ)
t
, see e.g. Mardia et al.
(1979), with ensuing maximum `
n,max
=
1
2
n log |
b
Σ|
1
2
np{1 + log(2π)}. This leads
8 CONFIDENCE FOR CHANGE-POINTS
to a clear formula for the profile log-likelihood function for the model which takes
N
p
(ξ
L
, Σ
L
) to the left and N
p
(ξ
R
, Σ
R
) to the right; indeed,
`
prof
(τ) =
1
2
τ log |
b
Σ
L
(τ)|
1
2
(n τ) log |
b
Σ
R
(τ)| (3.4)
plus irrelevant constants. Here
b
Σ
L
(τ) = (1)
P
iτ
(y
i
¯y
L
)(y
i
¯y
L
)
t
, with ¯y
L
the
average to the left, and similarly for
b
Σ
R
(τ).
Analogous calculations for the case of a common Σ across the range of data, but
with different mean levels ξ
L
and ξ
R
, lead to
`
prof
(τ) =
1
2
n log |
b
Σ(τ)|, with Σ(τ) = (τ/n)
b
Σ
L
(τ) + (1 τ/n)
b
Σ
R
(τ). (3.5)
In particular, the ML estimator bτ for this model is the value of τ minimising
log |
b
Σ(τ)|. This also yields the deviance formula
D(τ, y) = n{log |
b
Σ(τ)| log |
b
Σ(bτ)|},
with bτ the ML estimator.
3.4. Confidence for the degree of change. In addition to spotting the real
breakpoint τ
true
itself there is often interest in the degree of change taking place,
say via a suitable one-dimensional distance measure δ = δ(θ
L
, θ
R
). The natural
estimator is
b
δ = δ(
b
θ
L
(bτ),
b
θ
R
(bτ)), (3.6)
featuring the ML estimators of the left and right parameters, calculated at the ML
position bτ . The distribution of δ(
b
θ
L
(τ),
b
θ
R
(τ)), for a given τ, is typically close to a
normal, but with a statistical bias of size O(|τ τ
true
|) + O(|τ τ
true
|/(n τ)),
depending on how close τ is to the real value. The distribution of
b
δ of (3.6) is
a complex mixture of many such approximate normals, and with variable biases,
depending also on how precise bτ is for estimating τ
true
.
In our investigations we have found it a sounder general approach to go for
the profiled log-likelihood and deviance function, as for method B above, but now
profiling for δ. The recipe is hence to compute
`
prof
(δ) = max{`(τ, θ
L
, θ
R
): δ(θ
L
, θ
R
) = δ},
then the deviance D(δ, y
obs
) = 2{`
prof
(
b
δ) `
prof
(δ)}, followed by
cc(δ, y
obs
) = P
δ
{D(δ, Y ) D(δ, y
obs
)} = L
δ
(D(δ, y
obs
)). (3.7)
There are two options here, for defining and then computing the probability distri-
bution L
δ
of D(δ, Y ); these are related and often lead to very nearly the same results.
The first is to fix τ at the ML position bτ and compute L
δ
under (bτ,
b
θ
L
(δ),
b
θ
R
(δ)),
the position in the parameter space maximising `(τ, θ
L
, θ
R
) under the profiling con-
straint δ(θ
L
, θ
R
) = δ. The second is to follow the full profiling also for the τ part,
CONFIDENCE FOR CHANGE-POINTS 9
i.e. finding for given δ the point (bτ(δ),
b
θ
L
(δ),
b
θ
R
(δ)) at which the log-likelihood is
maximised, again under δ(θ
L
, θ
R
) = δ but without fixing τ at the ML position. In
both cases one computes L
δ
(·) and hence cc(δ, y) of (3.7) by simulating datasets y
L
and y
R
to the left and the right from the estimated models and then computing the
log-likelihood functions and hence D(δ, Y ). This approach to reaching confidence
inference for a degree of change parameter is illustrated for the ratio of Poisson rate
parameters in Section 6 and for a ratio of standard deviances inside a broader model
in Section 8.
4. Monitoring bridges and homogeneity tests
To apply the general method proposed in Section 2, particularly when encounter-
ing models outside the slim standard list where explicit tests might be available, one
needs methods for testing distributional homogeneity of a sequence of observations,
i.e. that θ
a+1
, . . . , θ
a+b
associated with observations a + 1, . . . , a + b have remained
unchanged. Here we describe some tests of this type. The monitoring bridges we con-
struct based on log-likelihood maxima, along with associated goodness-of-fit tests,
appear to be new and ought to have independent interest.
4.1. Monitoring bridges. Consider the sequence y
1
, . . . , y
n
, with y
i
f(y, θ
i
).
Classes of such tests for constancy of the θ
i
have been worked with in Hjort &
Koning (2002). In particular, one may use the monitoring process
M
n
(t) = n
1/2
b
J
1/2
X
int
u(Y
i
,
b
θ) for 0 t 1, (4.1)
in terms of the score function u(y, θ) = log f(y, θ)/∂θ and the maximum likelihood
estimator
b
θ assuming θ
i
= θ. Also,
b
J is the p×p estimated Fisher information matrix,
with p the dimension of the model. We note that M
n
(·) consists of p components,
each starting and ending in zero; also, M
n
is constant on each cell [k/n, (k + 1)/n),
with M
n
(k/n) = n
1/2
b
J
1/2
P
ik
u(Y
i
,
b
θ). Hjort and Koning prove that M
n
d
M,
the limit having p independent components W
0
1
, . . . , W
0
p
, each a Brownian bridge.
Hence plotting M
n,j
and checking various behavioural aspects, like their maxima or
minima, leads to clear tests for homogeneity. These may be used in connection with
the general construction of Section 2.
One particular version of this strategy is to test homogeneity using
Z
n
= max
jp
kM
n,j
k = max
jp
max
kn
|M
n,j
(k/n)|.
10 CONFIDENCE FOR CHANGE-POINTS
Under homogeneity, Z
n
d
Z = max
jp
max
0t1
|W
0
j
(t)|. The distribution for a
single of these maxima of a Brownian bridge can be expressed as
H(z) = P {max
0t1
|W
0
(t)| z} = 1 + 2
X
k=1
(1)
k
exp(2k
2
z
2
), (4.2)
as proved in Billingsley (1968, Ch. 3). For the case the maximum over several
asymptotically independent components, as with the construction (4.1), we have
P {Z
n
z} H(z)
p
, which is easily computed. Other variations can of course be
used here, like the sum of maxima rather than the maximum of maxima, or the
sum of Cram´er–von Mises type statistics n
1
P
n
k=1
{M
n,1
(k/n)
2
+ ···+ M
n,p
(k/n)
2
}.
The latter tends in distribution to
P
p
j=1
R
1
0
W
0
j
(t)
2
dt, which can be computed and
tabled via simulations, or via results obtained in Cs
¨
org˝o & Faraway (1996).
4.2. New monitoring bridges for model homogeneity. Here we are however
eager to build a new type of test, using the succession of attained log-likelihood
maxima. Assume homogeneity, i.e. that there is a common θ
0
underlying the obser-
vations. With `
j
= `
j
(θ) the log-likelihood function based on y
1
, . . . , y
j
, we compute
the maximum likelihood estimate
b
θ
j
and the associated log-likelihood maximum
b
`
j
= `
j
(
b
θ
j
). Computing the maximum likelihood estimator takes at least p observa-
tions. Our monitoring bridges take the form
b
B
n,j
= n
1/2
{
b
`
j
(j/n)
b
`
n
}/bκ for j = p, . . . , n. (4.3)
Here bκ is a consistent estimator of the standard deviation κ of log f (Y, θ
0
), e.g.
bκ
2
=
1
n
n
X
i=1
{log f(y
i
,
b
θ)
b
ξ}
2
,
where
b
ξ =
b
`
n
/n the estimate of ξ = E
θ
0
log f(y, θ
0
) =
R
f
θ
0
log f
θ
0
dy.
We show below that the process with these
b
B
n,j
values tends to a Brownian
bridge, under the null hypothesis of homogeneity. More precisely, consider the piece-
wise constant process
b
B
n
on [0, 1] with values
b
B
n,j
on [j/n, (j + 1)/n) for j p, and
zero for [0, p/n). The claim is that under unchanging model conditions,
b
B
n
d
W
0
in D[0, 1], (4.4)
the limit being a Brownian bridge (a zero-mean Gaussian process with covariance
function s(1 t) for s t). The convergence in distribution in question takes place
in the space of all functions x: [0, 1] R, right continuous with limits from the
left, equipped with the Skorohod topology; cf. Billingsley (1968). Plotting the
b
B
n,j
,
therefore, gives a monitoring bridge which should behave like a Brownian bridge
under homogeneity conditions. The weak convergence result (4.4) implies h(
b
B
n
)
d
h(W
0
) for all continuous functionals, so that max
pjn
|
b
B
n,j
|
d
max
0t1
|W
0
(t)|,
CONFIDENCE FOR CHANGE-POINTS 11
(n p + 1)
1
P
n
j=p
b
B
2
n,j
d
R
1
0
W
0
(t)
2
dt, etc. Among the benefits of the new
goodness-of-fit construction (4.3) is that a multidimensional parametric family is
mapped directly into a one-dimensional monitoring bridge.
To prove (4.4), start out considering the partial-sum process
A
n
(t) = n
1/2
X
i[nt]
{log f(y
i
, θ
0
) ξ} = n
1/2
(`
j
jξ) for 0 t 1,
writing `
j
=
P
ij
log f(y
i
, θ
0
) and j = [nt] (so that j/n tends to t). From Donsker’s
theorem, cf. Billingsley (1968, Ch. 3), A
n
d
A, the Brownian motion process. It
then follows that the process B
n
, defined by B
n
(t) = A
n
(t) tA
n
(1), converges in
distribution to the process B, defined by B(t) = A(t) tA(1), and this limit is
demonstrably a Brownian bridge process on [0, 1]. Also,
B
n
(j/n) = n
1/2
{`
j
(j/n)`
n
}/κ,
i.e. the tying-down has caused ξ to not being present.
We may now prove that
b
A
n
and
b
B
n
have the same limits as A
n
and B
n
, where
b
A
n
(t) = n
1/2
(
b
`
j
jξ)/bκ and
b
B
n
(t) = n
1/2
{
b
`
j
(j/n)
b
`
n
}/bκ
for t [j/n, (j + 1)/n). To show this, note from a Taylor expansion argument that
`
j
(θ
0
) = `
j
(
b
θ
j
) +
1
2
(θ
0
b
θ
j
)
t
`
00
j
(
b
θ
j
)(θ
0
b
θ
j
) + o
pr
(1), which leads to
b
`
j
= `
j
(θ
0
) +
1
2
W
j
+ o
pr
(1), (4.5)
where W
j
= j(
b
θ
j
θ
0
)
t
b
J
j
(
b
θ
j
θ
0
) + o
pr
(1), with
b
J
j
= (1/j)`
00
j
(
b
θ
j
) being the nor-
malised observed Fisher information after j data points. The W
j
tends to a χ
2
p
as j increases. Hence the differences max |
b
A
n
A
n
| and max |
b
B
n
B
n
| are both
O
pr
(p/
n), which goes to zero in probability. This proves claim (4.4).
Note that
b
`
j
of
b
A
n
overshoots `
j
(θ
0
) of A
n
, essentially with the amount
1
2
W
j
, a
random variable with distribution tending to a half a χ
2
p
, with mean value
1
2
p. This
suggests using the sample-size modification n
1/2
(
b
`
j
1
2
pjξ)/bκ for
b
A
n
, which with
a bit of algebra leads to the modification
B
n,j
= n
1/2
{
b
`
j
(j/n)
b
`
n
1
2
p(1 j/n)}/bκ
for
b
B
n,j
of (4.3). This version is closer in distribution to that of a Brownian bridge
for finite n. We also point out that the key result (4.4) continues to hold also
in situations with short-range dependence, as for most time series models. This
is essentially since the partial-sum process A
n
above still tends to the Brownian
motion, under weak assumptions of this type; see Billingsley (1968, Ch. 4).
12 CONFIDENCE FOR CHANGE-POINTS
5. Performance
In previous sections we have developed a general machinery for confidence in-
ference for change-points. It is clear from these developments that there are several
available methods, for a given dataset and a given vehicle model. In particular, for
general method A there is a choice to be made for the homogeneity test. In the
present section we consider performance issues for the resulting confidence distribu-
tions, also comparing method A with method B. The primary performance aspect is
that the confidence distributions really come close to delivering adequate coverage,
which in our change-point context means that the confidence curve construction
cc(τ, y) should have G(α) = P
τ
{cc(τ, Y ) α} close to α. For method B, the distri-
bution of U
τ
= cc(τ, Y ) is never perfectly uniform, since it is discrete, though G(α)
is often seen to be close to α with our constructions. For method A, however, the
uniformity at the true change-point value τ is exact (as long as the homogeneity
tests on each side are exact), as demonstrated in Section 2, and will be retained
even if the change-point is very clear and only one τ value, the true one, “survives”
at all levels. In that case the minimum of cc
A
(τ) will be uniformly distributed. This
has consequences for the interpretation of the confidence curve defined by method
A: while the τ minimising cc
A
(τ, y) may be considered an estimate of the change-
point, the actual minimal value of cc
A
(τ, y) is of limited interest, and should not be
interpreted as a measure of certainty of the change-point estimate.
A second performance aspect, which is a measure of certainty of the change-
point estimate, is that a cc(τ, y) should lead to ‘thin’ or narrow confidence sets
{τ : cc(τ, y) α}, for most or all values of the confidence level α. We measure such
thinness or slimness here by the number of τ belonging to the confidence set where
cc(τ, y) α, for a range of α levels (rather than the width or range of the set, as the
sets may be non-connected); for simplicity we use the term ‘size’ below to indicate
such numbers.
Schweder & Hjort (2016, Ch. 5) offer a broad discussion of performance and risk
functions for confidence distributions, also identifying classes of situations where
there is a unique optimal confidence procedure; see also the discussion on perfor-
mance in Xie & Singh (2013). Such clear results seem out of reach when it comes
to confidence for change-points, however. Below we report briefly on investigations
into the mentioned performance aspects for our confidence methods.
5.1. Method A with different tests. Method A is a general method for con-
structing confidence sets for a change-point, but depends on having a well-working
test of homogeneity for the segments 1, . . . , τ and τ + 1, . . . , n. It may also depend
upon the choice of the test levels at work in (2.1); here we compare having the
CONFIDENCE FOR CHANGE-POINTS 13
fixed level, i.e.
α
α, with the alternative where it depends on the sizes of the
segments, via α
τ/n
α
1τ/n
. Considering the simple model with y
i
N(ξ
L
, 1) to the
left of τ and y
i
N(ξ
R
, 1) to the right, and with ξ
L
and ξ
R
unknown, we have
investigated three different tests and the two different versions of test levels via sim-
ulations. The first test is that used in connection with the (2.2) illustration, using
Q
a+1,a+b
=
P
a+b
i=a+1
(Y
i
¯
Y )
2
with a χ
2
b1
null distribution. The second test uses
the regression slope coefficient from a regression model; on the segment 1, . . . , τ we
consider
b
b =
P
τ
i=1
(i
¯
i)y
i
/M(τ ), where we know that M(τ)
b
b
2
χ
2
1
, under the
homogeneity hypothesis; here M(τ) =
P
iτ
(i
¯
i)
2
and
¯
i is the average of 1, . . . , τ.
The test for the segment τ + 1, . . . , n is similar. The third test uses monitoring
bridges from Hjort & Koning (2002), as presented in Section 4.1. For this model the
monitoring processes become
M
L
(t) =
1
τ
1/2
X
i[τt]
(Y
i
b
ξ
L
) and M
R
(t) =
1
(n τ)
1/2
X
τ+1iτ +1+(nτ 1)t
(Y
i
b
ξ
R
)
to the left and to the right of τ , respectively. From these processes we use V
L
=
max
t
|M
L
(t)| and V
R
= max
t
|M
R
(t)| as test statistics, with the theory from Hjort
& Koning (2002) implying that these are asymptotically distributed as maxima of
Brownian bridges; cf. (4.2).
method 50% coverage 50% size 90% coverage 90% size 95% coverage 95% size
A-I 0.49 0.50 34.01 17.19 0.88 0.88 112.72 66.28 0.94 0.95 135.86 86.13
A-II 0.52 0.51 10.30 8.91 0.90 0.90 33.64 23.62 0.94 0.95 44.96 28.72
A-III 0.60 0.58 10.53 9.28 0.93 0.93 28.88 23.43 0.97 0.96 37.57 27.81
B 0.50 0.51 2.47 2.11 0.90 0.90 10.59 7.97 0.95 0.95 14.60 10.51
Table 5.1. Coverage and mean size of confidence sets produced with method A
with three different tests (and test level depending on τ) and with
method B, applied to the normal model with known variance, with
n = 200. Test A-I is the simple test, A-II is the regression test and A-
III is the Hjort–Koning test. The leftmost numbers in each column
are results from datasets with τ = 25, the rightmost numbers are
results from datasets with τ = 100. Each number is based on 10
3
simulated datasets.
The simulations were carried out by generating datasets of size n = 200. We
examined different combinations of position of τ , confidence levels, and difference
between the left and right levels. Here we briefly report on the cases where τ
positions were set to 25, 50, 75, 100 (cases 175, 150, 125 are fully symmetric with
25, 50, 75), and with ξ
L
= 2.2 and ξ
R
= 3.3, indicating a difference not easy to
tell immediately from the data. In order to evaluate the six different combinations
14 CONFIDENCE FOR CHANGE-POINTS
of tests and test levels, the coverage and size (number of τ values, rather than the
range from smallest to largest value) of the confidence sets of level 0.50, 0.90 and 0.95
were recorded. One method is considered superior (more powerful) than another if
it produces slimmer confidence sets while keeping the correct coverage. Method A
with tests 1 and 2 has the correct coverage probability, per construction, and this is
reflected in the simulations (see Table 5.1). The third test (Hjort–Koning) is based
on an asymptotic result and therefore does not have exactly the right coverage,
however. The simulations reveal that the deviation is generally small, for example
a 95% confidence set typically covers the true τ value 97% of the time. Tests 2
and 3 produce confidence sets of very similar size, but the first test is clearly less
powerful than the two others. For example, while test 2 and 3 produce confidence
sets with a mean size of 29 and 28 at the 95% level for τ = 100, the first test has
confidence sets of mean size 86. When it comes to the choice of test level, it is
beneficial to let the level depend on τ, in the manner of α
τ/n
α
1τ/n
, rather than
using
α
α in (2.1), but the differences between these two alternatives tends to be
small; in Table 5.1 we therefore include only the first choice. For all methods the
resulting confidence sets are smaller for τ values near the middle of the data (close
to 100). The opposite effect is most obvious for the datasets with τ = 25, where the
confidence sets typically are close to 1.5 times larger than the confidence sets from
data with τ = 100. The results for datasets with τ equal to 50 and 75 are not shown
here, but have been seen to be fairly close to the results for τ = 100. For the good
performance of method B see Section 5.2
We also investigated the behaviour of the different versions of method A when
datasets without change-points were generated. In these cases, the method produces
extremely wide confidence sets, generally spanning nearly the entire set of possible
τ values, thus indicating, as they should, that the data are homogeneous on both
sides of nearly all possible choices of τ .
Further examined were the two different tests for method A for the model
where the variance is unknown (and potentially different on the two segments);
y
i
N(ξ
L
, σ
2
L
) to the left of τ and y
i
N(ξ
R
, σ
2
R
) to the right. The first test is an
extension of the regression based test above. Writing down the required formulae
for the full segment 1, . . . , n (and then applying these for the left and right seg-
ments later on), we have
b
b =
P
(i
¯
i)y
i
/M with M =
P
n
i=1
(i
¯
i)
2
, and employ
t = M
1/2
b
b/bσ, where bσ
2
=
P
n
i=1
{y
i
¯y
b
b(i
¯
i)}
2
/(n 2). Here
¯
i is the average of
indexes employed. Under homogeneity, t t
n2
. The second test is an application
of the monitoring bridges from Hjort & Koning (2002). This time we have two un-
known parameters and thus the monitoring process is two-dimensional. Following
CONFIDENCE FOR CHANGE-POINTS 15
the recipe for these monitoring processes, we have to the left of τ
M
L
(t) = τ
1/2
X
i[τt]
Z
i
(Z
2
i
1)/
2
!
for 0 t 1,
with Z
i
= (Y
i
b
ξ
L
)/bσ
L
, and as the test statistic we use
V
L
= max{max
t1
|M
L,1
(t)|, max
t1
|M
L,2
(t)|},
the maximum of the absolute maxima of the two bridges. The asymptotic distribu-
tion of V
L
(under homogeneity) can be easily computed via H(z)
2
with H(z) from
(4.2). We construct a similar test statistic for the segment to the right of τ.
Again we generated datasets of size n = 200 with four different τ values (25,
50, 75, 100), and again we recorded the coverage and size (number of τ values) of
the confidence sets of level 0.50, 0.90 and 0.95. We studied the two tests for two
different settings, one where the change-point is a change in the mean, with ξ
L
= 2.2,
ξ
R
= 3.3 and σ
L
= σ
R
= 1, and the other where it is a change in the variance level,
with ξ
L
= ξ
R
= 2.2, σ
L
= 1 and σ
R
= 2.
method 50% coverage 50% size 90% coverage 90% size 95% coverage 95% size
A-I 0.49 0.52 6.32 6.14 0.92 0.90 28.44 20.53 0.96 0.96 58.03 34.03
A-II 0.62 0.60 14.88 11.92 0.94 0.94 43.03 28.65 0.97 0.98 40.97 26.40
B 0.49 0.51 2.73 2.19 0.86 0.89 12.72 8.57 0.92 0.95 18.00 11.36
A-I 0.50 0.52 99.27 99.08 0.92 0.90 177.26 176.86 0.96 0.95 186.76 185.93
A-II 0.63 0.64 35.24 16.59 0.94 0.94 130.31 37.57 0.97 0.98 155.42 43.82
B 0.46 0.49 4.79 3.24 0.85 0.90 21.59 12.28 0.89 0.96 30.95 16.28
Table 5.2. Coverage and mean size of confidence sets produced with method
A with two different tests (and test level depending on τ) and with
method B, applied to the normal model with unknown variance. The
first three rows concern the case where the change-point is a change
in the mean, and the second three concern the case where the change-
point is a change in the variance. Test A-I is the regression test and
test A-II is the Hjort–Koning test. The leftmost numbers in each
column are results from datasets with τ = 25, the rightmost numbers
are results from datasets with τ = 100. Each number is based on 10
3
simulated datasets.
For datasets with a change in the mean, the regression-based test was slightly
more advantageous than the Hjort–Koning test, having the correct coverage and
narrower confidence sets (see Table 5.2). However, the regression-based test is only
constructed to discover changes in the mean levels and the Hjort–Koning test is
therefore a more flexible test, able to discover change-points also when the change
only affects the variance (see Table 5.2, lower part). For both settings, the resulting
16 CONFIDENCE FOR CHANGE-POINTS
confidence sets were much larger for datasets with τ = 25. This was most apparent
for the Hjort–Koning test in the second setting (change in the variance), where the
size of the confidence sets increased from around 44 on the 95% level (for τ equal to
50, 75 or 100) to 155 for τ = 25.
5.2. Method A versus Method B. The two methods proposed in this article have
similar aims, but different points of departure and different performances. While
method B assumes a model where there is a change-point (exactly one) on the whole
data segment, method A considers possible τ values as points where the data on each
side of τ are deemed homogeneous. The performance of method A is thus mostly
dependent on the power of the chosen test in discovering lack of homogeneity. We
included method B in the three simulation studies described above, and they reveal
that method B produces clearly smaller confidence sets compared to the different
versions of method A we have included. However, method B has a tendency to
produce confidence sets with slightly lower coverage than the specified level (and the
confidence sets should therefore be larger). The coverage problem is more apparent
when the change-point is far from the centre of the data, especially for the more
complex model with unknown variance (see Table 5.2). Method B seems nonetheless
to outperform A in these simulations. We still consider method A to be fruitful, with
a higher degree of flexibility considering the choice of test and more applicable to
complicated high-dimensional or even nonparametric situations.
5.3. The Bayesian approach. Bayesian solutions to the change-point problem are
not hard to put up, but they require of course a prior to be set up for (τ, θ
L
, θ
R
),
sometimes with ad hoc constructions. This leads to a posterior distribution for τ .
Suppose in particular that τ is given the prior π
0
(τ), independently of priors π
L
and
π
R
for θ
L
and θ
R
. This leads to the posterior distribution
π(τ |data) π
0
(τ)λ
L
(τ)λ
R
(τ),
expressed via the marginal left and right likelihoods
λ
L
(τ) =
Z
L
L
(θ
L
)π
L
(θ
L
) dθ
L
and λ
R
(τ) =
Z
L
R
(θ
R
)π
R
(θ
R
) dθ
R
.
These can be computed explicitly in a few models, and lead to a clear Bayesian
posterior for τ. Via numerical integration methods or MCMC one may also compute
such a π(τ |data) in a range of other situations, even without clear formulae for the
marginal likelihoods; see Carlin et al. (1992) and Fearnhead (2006).
Useful approximations emerge via the following Taylor expansion arguments,
which we first put up for the case of n observations from the same model with a
CONFIDENCE FOR CHANGE-POINTS 17
prior π(·) for the same θ parameter, of dimension p:
λ =
Z
exp{`
n
(θ)}π(θ) dθ
.
=
Z
exp{`
max
1
2
(θ
b
θ)
t
n
b
J(θ
b
θ)}π(θ) dθ
.
= exp(`
max
)|n
b
J|
1/2
π(
b
θ)(2π)
p/2
.
Here
b
θ is the ML estimator and
b
J = n
1
2
`(
b
θ)/∂θθ
t
the normalised Hessian
matrix, converging with increasing n to a certain matrix. This leads to log λ = `
max
1
2
p log n + O
pr
(p), akin to the approximation leading to the Bayesian information
criterion BIC (see Claeskens & Hjort (2008, Ch. 4)).
Now going back to the change-point analysis, and keeping the leading terms
only, we are led to the approximation
π(τ |data) π
0
(τ) exp
`
prof
(τ)
1
2
p log{τ (n τ )}
= π
0
(τ) exp{`
prof
(τ)}{τ(n τ)}
p/2
.
(5.1)
This assumes that the left and right priors for θ
L
and θ
R
are not overly different. At
any rate, the sizes of the leading terms of log π(τ |data) are O
pr
(n) and O(p log τ +
p log(n τ )), with remainder terms of size O
pr
(p). The approximation is useful
for the computational side of things, as it bypasses the need for high-dimensional
integration or for MCMC setups, but also for shedding light on the behaviour of the
posterior distribution and for how it differs from the frequentist approaches we are
developing and advocating in the present paper. We learn e.g. that the Bayesian
posterior has a tendency to push τ towards the extreme ends. The (5.1) formula
is incidentally exactly correct for the case of the model N
p
(ξ
L
, I
p
) to the left and
N
p
(ξ
R
, I
p
) to the right, and with flat priors for ξ
L
and ξ
R
.
For quantities associated with smooth parametric models one is used to the
phenomenon described via so-called Bernshte˘ın–von Mises theorems, that Bayesian
and frequentist inference tend to agree well, and with the prior in question being
reasonably quickly washed out by the data provided the parameter dimension being
low; see e.g. the discussion in Hjort et al. (2010, Introduction). This is different
here, however, in view of (5.1) and its consequent
log π(τ |data) = log π
0
(τ) + `
prof
(τ)
1
2
p{log τ + log(n τ )} + O
pr
(1),
which shows both that there is a certain bias inherent in the Bayes construction
and that this bias is more slowly disappearing with increasing sample size than in
regular parametric models. Simple simulation exercises reveal that the distribution
of U
τ
=
P
τ
B
π(τ
B
|Y ) +
1
2
π(τ |Y ), the half-correction version of the cumulative
posterior distribution for the Bayesian parameter τ
B
, computed under the true τ , is
often far from the uniform, even for moderately large n. From such investigations,
18 CONFIDENCE FOR CHANGE-POINTS
along with those reported on earlier in this section, it is apparent that confidence
distribution Method B based on the deviance and its distribution does a much better
job than the Bayesian apparatus when it comes to delivering confidence intervals
with correct coverage. The reason the Bayes method does badly in this regard is
partly that there is an inherited bias of size O(p log n + p log(n τ )) in the log-
posterior, but even more so that the distribution
π(τ |y) exp{`
prof
(τ)}
also often delivers inaccurate confidence, even for moderately large n in simple mod-
els. This is in contrast to how things pan out for parameters of smooth regular mod-
els, where such a recipe typically leads to accurate coverage with increasing sample
size, as per the Bernshte˘ın–von Mises theorems (here in the form of the Laplace
type inverse probability, Bayes with a flat prior). In this connection see also Fraser
(2011), who argues that Bayes is sometimes only ‘quick and dirty confidence’, and
Efron (2015), who is concerned with frequentist accuracy of Bayes solutions in a
general perspective. We also mention that the ML estimator bτ typically does better
than the Bayes estimator bτ
B
maximising the posterior distribution (i.e. the Bayes
solution under a 0-1 loss function), as judged by e.g. mean absolute deviation, as
seen via simulation experiments.
6. Application 1: British mining accidents
As a first, simple illustration, we apply method B of Section 3 to a dataset from
the change-point literature, the number of British coal-mining disasters from 1851
to 1962; see Jarrett (1979) for relevant background and for certain corrections that
were made to earlier accounts. With y
i
the number of mining disasters in year i, we
take these to be independent and Poisson distributed with parameter θ
L
for i τ
and θ
R
for i τ +1. This is the model used for these data by Carlin et al. (1992), for
a Bayesian analysis, where clear posterior distributions are found for the parameters
based on their given prior for (τ, θ
L
, θ
R
). They also provided a posterior density for
the relative change parameter ρ = θ
L
R
. In this case, our methods give very similar
results to the above-mentioned Bayesian analysis; with our confidence distributions
matching their posteriors, but without priors.
In order to compute the confidence curve for the breakpoint, we need the de-
viance function and the profile log-likelihood function, here taking the form
`
prof
(τ) = τ ¯y
L
(τ){log ¯y
L
(τ) 1} + (n τ)¯y
R
(τ){log ¯y
R
(τ) 1}.
The ML estimates are bτ = 41 (corresponding to year 1891),
b
θ
L
= 3.098 and
b
θ
R
=
0.901. From the estimates of θ
L
and θ
R
we simulated datasets under each possible
change-point value (that is, for all years between 1851 and 1961), calculated the
CONFIDENCE FOR CHANGE-POINTS 19
deviance functions and computed the confidence curve from recipe (3.3); this yields
the left panel of Figure 6.1. The curve agrees with the posterior distribution for
the change-point given in Carlin et al. (1992), where the posterior mode also agrees
with the ML estimate.
1860 1900 1940
0.0 0.2 0.4 0.6 0.8 1.0
year
confidence sets
● ●
● ●
2 3 4 5 6
0.0 0.2 0.4 0.6
ρ
confidence density
Figure 6.1. Left panel: Confidence curve for the change-point τ , using the de-
viance based method B. Right panel: Confidence density for the de-
gree of change ρ = θ
L
R
, via method B (full line), and the Bayesian
method (dashed line).
In order to analyse the degree of change ρ (the ratio between the rates of disas-
ters, for the past state of affairs and for the present), we reparametrise the model as
y
i
Pois(ρθ) for i τ and y
i
Pois(θ) for i τ + 1. The log-likelihood function is
`(τ, ρθ, θ) = τ{−ρθ + ¯y
L
log(ρθ)} + (n τ)(θ + ¯y
R
log θ),
which we then maximise over θ and τ to reach the profile log-likelihood function
`
prof
(ρ) = bτ(ρ)
ρ
b
θ(ρ) + ¯y
L
(bτ(ρ)) log{ρ
b
θ(ρ)}
+{n bτ(ρ)}{−
b
θ(ρ) + ¯y
R
(bτ(ρ)) log
b
θ(ρ)},
with
b
θ(ρ, τ ) = {τ ¯y
L
+ (n τ)¯y
R
}/{τρ +n τ} = n¯y/{τρ + n τ} and bτ obtained by
maximising over all possible τ values. The ML estimate for the degree of change was
3.437, and the confidence curve was obtained by (3.3) by simulating datasets from
20 CONFIDENCE FOR CHANGE-POINTS
a grid of ρ values, using the overall ML estimate for τ along with
b
θ
L
(ρ) and
b
θ
R
(ρ),
following the recipe of Section 3.4. The confidence curve cc(ρ, y) can be converted
to a cumulative confidence distribution C(ρ, y), via cc(ρ, y) = |1 2 C(ρ, y)|, which
then via numerical derivation yields a confidence density, say c(ρ, y), displayed in the
right panel of Figure 6.1. This may now be compared to the posterior density for ρ
arrived at with any reasonable start prior for (τ, θ
L
, θ
R
), e.g. from MCMC methods
presented in Carlin et al. (1992). Our prior-free method gives results very similar
to those of the Bayesian machinery, with the almost noninformative priors used by
Carlin et al. (1992). The right panel of Figure 6.1 displays two very similar curves;
the confidence density and the Bayesian posterior calculated using a flat prior for
τ and independent almost noninformative Gamma priors with parameters (
1
2
,
1
2
) for
the two levels.
The simulations required for constructing the confidence curve with method B
can be time-consuming, but here we may resort to an approximate solution based on
the Wilks theorem. If we fix τ at the ML value bτ, and proceed with deviance calculus
profiling over (θ
L
, θ
R
) subject to θ
L
R
= ρ, then the D(ρ, Y ) is very closely approx-
imated with a χ
2
1
, leading to a confidence curve for ρ via cc(ρ, y
obs
) = Γ
1
(D(ρ, y
obs
)),
where Γ
1
is the cumulative distribution function of a χ
2
1
distribution. The resulting
confidence curve is indistinguishable from the one computed with simulations and
displayed in the right panel of Figure 6.1, demonstrating that τ is sufficiently well
estimated in this case.
7. Application 2: Tirant lo Blanch
Our next change-point challenge concerns the Catalan novel Tirant lo Blanch.
This chivalry romance, written in the 1460s, can be considered the world’s first
novel, and was incidentally much admired by Cervantes (who wrote the more famous
Don Quixote about 150 years later). Most scholars agree that the novel had two
authors; the first author Joanot Martorell died before the completion of the novel,
and Marti Joan de Galba claimed to have finished it. Hence there is a change-point
problem, where we should hunt for the chapter number where the change from the
first to the second author takes place. Earlier statistical analyses include Gir´on
et al. (2005), Riba & Ginebra (2005), Koziol (2014) and Chen & Zhang (2015).
Most researchers favouring the change-of-author hypothesis believe that the change
takes place towards the end of the 487 chapter long book, more accurately between
chapter 350 and 400 (Chen & Zhang, 2015).
Different aspects of the chapters and the writing may be considered for statis-
tical measurements and then collected from the text. Analysing a quarrel between
Nobel Prize winners, Hjort (2007) used statistical modelling of sentence lengths to
CONFIDENCE FOR CHANGE-POINTS 21
discriminate between two literary corpora, for example, and in Section 10.2 we are
indeed using such information to assist us in pinpointing the author change-point.
Presently we are concentrating on the word lengths in each chapter, and we have
only considered the 425 chapters with more than 200 words. From these we collect
vectors y
i
of dimension 10, displaying the number of words of length 1, 2, 3 and so
on, up to the number of words equal to or longer than 10 letters. The aforementioned
authors have used the same dataset, and all, except for Chen & Zhang (2015), model
the 425 word count vectors as multinomially distributed. Chen & Zhang (2015) pro-
pose a graph based, nonparametric change-point method. Gir´on et al. (2005) adopt
a Bayesian framework and provide a posterior distribution for the change-point τ.
Similar models as in Gir´on et al. (2005) are assumed in Riba & Ginebra (2005),
but in a frequentist framework and without providing any uncertainty around the
change-point estimates. Koziol (2014) approaches the change-point problem with
Lancaster partitions of chi-squared tests of homogeneity.
Initial goodness-of-fit checks demonstrate that the word lengths in the different
chapters of the book have heterogeneous distributions; in particular, the pure multi-
nomial model favoured by several previous scholars, with fixed probabilities of word
lengths from chapter to chapter inside a segment, does not fit well/ allows for too
little variability between chapters. We therefore investigated three other models: an
overdispersed multinomial, that is the Dirichlet-multinomial distribution, and two
different multinormal models. The first one allows the change-point to affect both
the mean vectors and the covariance matrices, while the second assumes that the
authors differ in the mean vector only. To judge between candidate models we have
computed values of the Akaike information criterion, cf. Claeskens & Hjort (2008,
Ch. 3), defined as AIC = 2 `
max
2 dim, with dim the number of parameters esti-
mated in the model and `
max
the associated maximum of the log-likelihood function
(see Table 7.1). These AIC values give a clear indication that the multinormal model
has a better fit to the data, and we thus used the multinormal for the construction
of the confidence curve for the change-point.
model `
max
dim AIC
multinomial 14, 449 19 28, 936
Dirichlet-multinomial 13, 870 21 27, 780
multinormal 1 13, 722 109 27, 660
multinormal 2 13, 771 64 27, 671
Table 7.1. Number of parameters and AIC values for different models: multinor-
mal 1 is the model assuming that the two authors both have different
mean vectors and different covariance matrices, while multinormal 2
assumes that the authors differ only in the mean vector.
22 CONFIDENCE FOR CHANGE-POINTS
The multinormal model assumes that the observed proportions z
i
= y
i
/m
i
in
each chapter follow a multinormal distribution, with precision related to the sample
size. Disregarding element no. 10, since the proportions sum to one for each chapter,
the model used is
z
i
N
9
(ξ
L
, Σ
L
/m
i
) for i τ and z
i
N
9
(ξ
R
, Σ
R
/m
i
) for i τ + 1.
Here ξ
L
and ξ
R
are the mean vectors of these distributions of proportions, and Σ
L
and Σ
R
appropriate 9 × 9 covariance matrices. The confidence curve was obtained
by method B (Section 3). First we find the profile log-likelihood function, which in
generalisation of the result (3.5) to the present case with variance matrices Σ/m
i
becomes
`
prof
(τ) =
1
2
τ log |
b
Σ
L
(τ)|
1
2
(n τ) log |
b
Σ
R
(τ)|,
now with
b
Σ
L
(τ) =
1
τ
X
iτ
m
i
{z
i
b
ξ
L
(τ)}{z
i
b
ξ
L
(τ)}
t
,
b
Σ
R
(τ) =
1
n τ
X
iτ+1
m
i
{z
i
b
ξ
R
(τ)}{z
i
b
ξ
R
(τ)}
t
,
where
b
ξ
L
(τ) =
P
iτ
m
i
z
i
/
P
iτ
m
i
and similarly for
b
ξ
R
(τ). There is a consequent
formula for the deviance function for τ . The ML estimate for the change-point was
found to be bτ = 320. In the original numbering of the chapters this corresponds
to chapter 371, which is the same point estimate as with the ordinary multinomial
model in Riba & Ginebra (2005), and also the mode of the change-point posterior
distribution in Gir´on et al. (2005). The multinormal model gave the following ML
estimates for the mean of the world length proportions (the covariances matrices are
not given):
1 2 3 4 5 6 7 8 9 10
left 0.106 0.222 0.209 0.103 0.105 0.104 0.053 0.045 0.029 0.024
right 0.114 0.209 0.190 0.098 0.103 0.105 0.058 0.050 0.038 0.035
Table 7.2. Estimated proportions of words of different lengths, before and after
estimated change-point bτ = 320, via the multinormal model.
By simulating the distribution of D(τ, Y ) we obtain the confidence curve for τ,
shown in Figure 7.1. Interestingly, the curve indicates some confidence in τ = 295,
which corresponds to chapter 345, which is in accordance with the Bayesian posterior
distribution in Gir´on et al. (2005) and with some of the analyses based on summary
measures presented on the next page.
CONFIDENCE FOR CHANGE-POINTS 23
280 300 320 340
0.0 0.2 0.4 0.6 0.8 1.0
τ
confidence sets
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Figure 7.1. Confidence curve for the change-point τ , using method B based on
the multinormal model.
In addition to modelling the whole vector of proportions we can look at different
summary measures for each chapter, for example the average word length per chap-
ter. This was also used in Riba & Ginebra (2005), where they assumed a normal
model for the average word length per chapter and for the value taken by the first
principal component from correspondence analysis, yielding point estimates corre-
sponding to chapters 345 and 371, respectively. We also consider the average word
length, and in addition the standard deviation of word lengths, the proportions of
words of length 3 or less, and the proportions of words of length at least 8 letters,
in each chapter. The two last summary statistics are motivated by the fact that the
change in author seems to be mostly reflected in the proportions of short and long
words, cf. Table 7.2. Each of these summary measures, say w
i
, can be modelled as
w
i
N(θ
L
, σ
2
L
/m
i
) for i τ, and w
i
N(θ
R
, σ
2
R
/m
i
) for i τ + 1,
and confidence curves can easily be constructed by method B; see Figure 7.2.
Three of the summary measures give the most confidence to bτ = 295, corre-
sponding to chapter 345, but all of these curves also place some confidence on the
change-point taking place in chapter 371. When looking at the proportions of short
words (of length 1 or 2 or 3 letters) in each chapter, the most confidence is however
24 CONFIDENCE FOR CHANGE-POINTS
270 280 290 300 310 320 330 340
0.0 0.2 0.4 0.6 0.8 1.0
τ
confidence sets
Figure 7.2. Confidence curves for the change-point τ, using method B: Full line,
based on the average word length per chapter; dashed line, based
on the standard deviation in word lengths; dotted line, based on the
proportions of words of length 3 letters or less; and dot-dashed line,
based on the proportions of words of length 8 or more.
placed on bτ = 320, corresponding to chapter 371, consistent with the multinormal
analysis above. All our analyses for Tirant lo Blanch indicate that the change of
authors takes place towards the end of the book, with the most confidence placed
on the chapters 345 and 371. These results are consistent with previous statistical
analyses of the work (Riba & Ginebra, 2005; Gir´on et al., 2005; Koziol, 2014; Chen
& Zhang, 2015), with aspects of literary analyses (Rosenthal, 1984, preface), and
with the assertion made by the second author himself, who, in the afterword of the
book, writes that he completed the final quarter of the book.
8. Application 3: skiings days at Bjørnholt
The number of skiing days in a winter season is defined as the number of days
with at least 25 cm snow.
1
In Figure 8.1 the number of such days at the particular
location of Bjørnholt in Oslo’s skiing and recreation area Nordmarka are plotted
1
The definition and term ‘skiing day’ was introduced by the Norwegian meteorologist Gustav
Bjørbæk as the least amount of snow needed to avoid injury in case of a fall.
CONFIDENCE FOR CHANGE-POINTS 25
for the winter seasons 1896-97 to 2014-15. The expected number of skiing days
and the future prospects for snowy winters are of especially great interest to the
skiing enthusiasts. Moreover, these numbers are good indicators of how cold winters
are and provide indications of the general trend of temperature over a given period
of time. This suggests that joint analysis of such skiing days time series form yet
another potential source for studying climate change.
year
number of skiing days
0 50 100 150 200
1900 1920 1940 1960 1980 2000
observed
imputed
linear
connected linear
Figure 8.1. The number of skiing days for the winter seasons 1896-97 to 2013-14
at Bjørnholt. The imputed data points are meteorologists’ recon-
structions making use of nearby locations. The global linear trend
(dashed line) decreases with an estimated slope of about 0.40.
Some time after 1960 there appears to be a structural change in the
series. The estimated year for the change-point of the connected
linear model (full line) is at 1977, where the relative slope changes
from a modest 0.14 to a dramatic 1.22.
Letting Y
t
be the number of skiing days for year t, we consider change-point
models of the type
Y
t
= m(β
L
, t) + ε
t
for t τ and Y
t
= m(β
R
, t) + ε
t
for t τ + 1, (8.1)
with m(β, t) being suitable trend functions (here taken constant or linear), and where
{ε
t
} is an autoregressive time series model of order one, i.e. an AR(1). The latter is
defined via the representation ε
t
= ρε
t1
+ σδ
t
, with the δ
t
being independent and
standard normal. Some analysis suggests an AR(1) captures the essential depen-
dency structure here, with higher order autoregressions leading to overfitting. For
our analyses we do use the full data sequence, but to avoid instability in the estimated
model at the edges we only consider change-point candidates τ {1907, . . . , 2004}.
We will actually go through and briefly compare four different specialisations
of the model above. The three first take an unchanged AR(1) process for the ε
t
26 CONFIDENCE FOR CHANGE-POINTS
but three different trend functions, each with a change-point: (i) constant, where
m(β
L
, t) = β
L
and m(β
R
, t) = β
R
; (ii) linear, using disconnected linear regression
models of time, one to the left and one to the right of τ ; and (iii) connected linear,
meaning two separate linear models with the additional restriction of continuity,
i.e. using trend a
L
+b
L
t to the left and a
R
+b
R
t to the right, but with a
L
+b
L
(τ +
1
2
) =
a
R
+ b
R
(τ +
1
2
). Model (iv) uses a common linear a + bt trend across the 118 winters
but allows the σ associated with the ε
t
of (8.1) to jump from some σ
L
to some σ
R
.
The number of unknown parameters for these four models, including the change-
point τ itself, are 5, 7, 6, 5, respectively.
1900 1920 1940 1960 1980 2000
0.0 0.2 0.4 0.6 0.8 1.0
−3 −2 −1 0 1
0.0 0.2 0.4 0.6 0.8 1.0
left
right
Figure 8.2. The confidence sets for the change-point τ (left) and the confidence
curves for the two slopes (right) in the connected linear model. The
estimated change-point for the connected linear model (with its six
parameters) is 1977, where the mean slope changes from 0.14 to
1.22.
Our intention here is not to go into a detailed analysis of the underlying meteo-
rological phenomena. Instead we aim at demonstrating the usefulness of our general
method B of Section 3 for reaching confidence distributions, within the framework
of change-point models with dependent errors. The main focus is on τ, but we also
take an interest in the effect a change-point has on the estimated slopes. Method B
of Section 3 yields confidence inference for τ and for degree of change parameters,
within each of the four models just described.
The data in Figure 8.1 appear to indicate either a strong decreasing trend, or
a change in the structure of the underlying model, perhaps some time after 1960.
Our model (i), which has a change in a constant mean, finds via ML that the most
likely year for a change is 1988, with
b
δ =
b
β
L
b
β
R
= 45.65. In words, everything
is stable until 1988, then there is a massive drop, from 138.00 to 92.35, in the
CONFIDENCE FOR CHANGE-POINTS 27
1900 1920 1940 1960 1980 2000
0.0 0.2 0.4 0.6 0.8 1.0
0.5 1.0 1.5 2.0 2.5 3.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 8.3. The confidence curve (left) suggests three possible locations for a
change in σ, viz. 1915, 1970 and 1988, with the latter given most
confidence. The confidence curve for σ
R
L
(right) indicates that
the σ of the AR(1) part of (8.1) has increased, around 1988, with a
factor of about 1.61.
expected number of skiing days per winter. Then consider model (ii), with two
separate linear trends. The mean slope for the first part is approximately zero, with
b
β
L
= (139.84, 0.04) given as (intercept, slope). Then there is a sudden drop and
change in the expectation at the break-point bτ = 1988, now with a steady increase
thereafter, with
b
β
R
= (131.57, 2.12), almost returning to the pre 1988 change-point
level with an expected 118 and 120 skiing days for the 2013-14 and 2014-15 winter
seasons. The abrupt changes found when analysing these two models do not match
prior meteorological conceptions well, and indicate overfitting. For these reasons
we prefer models (iii) and (iv). Figure 8.2 pertains to change-point analysis within
model (iii), with connected linear trends, displaying a confidence curve for τ, with
point estimate 1977, and confidence curves for the (negative) slope parameters for
the trend before and after the break point.
At the outset it is by no means obvious that the heterogeneity seen in the data
(interpreted in a wide sense) is a result of a change in mean structure. The apparent
change of behaviour could potentially be caused by a sudden change in either de-
pendence (i.e. the ρ parameter), the variability (i.e. the σ), or both. Investigations
via method B do not provide any evidence of a change in the correlation struc-
ture. There is however some evidence that the standard deviation σ is not constant
across years, see Figure 8.3. This model (iv) suggests that there is a change in σ
around bτ = 1988. The estimated parameters are
b
β = (147.5, 0.27), bρ = 0.31,
(bσ
L
, bσ
R
) = (30.52, 49.15).
28 CONFIDENCE FOR CHANGE-POINTS
9. Application 4: the Hjort time series 1859-2012
As an illustration of our general method A of Section 2, we apply the new
monitoring bridge plots from Section 4.2 to first test for full homogeneity of a long
and prominent time series from fisheries sciences, and then to look for a regime shift.
The time series in question is the Hjort liver quality index time series for the skrei,
the Northeast Arctic cod. In marine biology this hepatosomatic index (HSI) is used
as a measure or indicator for the ‘quality of fish’ in a certain population, and then
typically studied as a time series; see Figure 9.1 (left panel). The index (in so-called
bulk form) may be represented as
HSI = 100 ×
total amount of liver
total amount of fish
= 100 ×
P
x
i
P
y
i
, (9.1)
where (x
i
, y
i
) represent the weight of the liver and the total weight of fish number
i in one or several catches of fishes; in the Lofoten fishery tens of millions of fish
are landed each year. The study of the liver quality index for the skrei goes back to
Hjort (1914), where such measurements for the time period 1880–1912 were recorded
and analysed, as part of his seminal work on the population dynamics underlying
the fluctuations of the great fisheries. The series has since then been extended both
forwards and backwards in time, to 1859–2012, yielding one of the longest time series
of marine science; see Kjesbu et al. (2014) and Hermansen et al. (2016).
The underlying dynamics and evolution of such series are of great importance
in marine biology. Studies of how the HSI evolves over time and interacts with and
are influenced by associated factors include Kjesbu et al. (2014); Vasilakopoulos &
Marshall (2015); Hermansen et al. (2016). Here we focus on a subset of this long
time series, namely the years 1921–2012, where also the detailed monthly average
temperatures for Kola are available, see Boitsov et al. (2012). From these monthly
averages the average winter temperature can be constructed, averaging the monthly
means from the start of October (previous year) to start of March (current year).
Letting Y
i
be HSI for year i, consider the model where
Y
i
= β
0
+ β
kola
x
i1
+ ε
i
, (9.2)
with i = 1, . . . , 90 representing the years 1922–2012, and where {ε
i
} is an autoregres-
sive process of order one and x
i1
is the winter Kola temperature for the previous
year (checks suggest that there is no real model fit improvement using a higher
order autoregressive model). Several tests indicate that last year’s winter average
temperature carries more relevant information for the present value of the HSI, than
does the same year’s winter temperature; this also matches biological arguments, see
Hermansen et al. (2016) for additional discussion. Model (9.2) is quite simple and is
CONFIDENCE FOR CHANGE-POINTS 29
1920 1940 1960 1980 2000
3 4 5 6 7 8 9
year
HSI and previous winter temperature
HSI
Kola
1940 1960 1980 2000
1.5 1.0 0.5 0.0 0.5 1.0 1.5
year
monitoring bridge plot
1987
Figure 9.1. Left panel: The Hjort liver quality index (HSI) series for 1921–
2012 (grey), along with average Kola winter temperature (black,
in degrees Celsius). The vertical lines indicate the range (1927–
2006) where we are searching for a potential change; the boundary
points are excluded due to instability of the methods at the edges.
Right panel: The monitoring bridge plot, reaching a maximum value
of 1.67, with a corresponding p-value less than 0.01, suggesting a
structural change in the model around 1987.
not meant to fully represent all the complex processes in the ocean influencing the
HSI index. The goal here is to illustrate our regime shift assessment methodology.
We use the theory of Section 4.2 to compute the monitoring bridge plot for the
HSI model (9.2), see Figure 9.1 (right panel). It indicates that the model is not
sufficient for describing the underlying mechanism generating the full time series.
The shape of the plot also suggests the existence of a regime shift. We shall search
for such a change-point, here using the general method A of Section 2 to construct
confidence sets for the location of this potential change. In short, the strategy is to
test for homogeneity to the left and to the right of each candidate point τ , using our
bridge plots. We do utilise the full data sequence in our analysis, but exclude the
first and last ten years from the list of candidate values for τ , which we hence take
as 1932, . . . , 2002. The resulting confidence curves are presented in Figure 9.2.
Our monitoring bridge tools are constructed to test the suitability of a model.
A structural break should therefore be interpreted as indicating that the underlying
model changes from one regime to another. Other terms used in marine science and
biology include ‘state shift’ and ‘critical transition’. A regime shift is characterised
by “relatively rapid change (occurring within a year or two) from one decadal-scale
30 CONFIDENCE FOR CHANGE-POINTS
period of a persistent state (regime) to another decadal-scale period of a persis-
tent state (regime)”; see King (2005) and also Brander (2010); Vasilakopoulos &
Marshall (2015). Also, note that the underlying framework for our general method
A assumes that the observations to the left and the right are independent of each
other, as per (2.1); within a segment, however, the observations may be strongly
dependent without violating the underlying assumptions of the method. For the
time series framework there is not strict independence between goodness-of-fit sta-
tistics computed to the left and the right of a given τ; the dependency is however not
strong here (the first order autoregressive model seems to capture most of structure),
and such a mild deviation from the underlying assumptions does not invalidate the
results using these versions of method A. A conservative Bonferroni correction, as
spelled out at the end of Section 2, yields a fairly similar confidence curve, for all
confidence levels above 0.60.
1920 1940 1960 1980 2000
0.0 0.2 0.4 0.6 0.8 1.0
year
confidence sets
1920 1940 1960 1980 2000
4 5 6 7 8 9
year
HSI
1991
µ
Global
µ
L
µ
R
Figure 9.2. Left panel: The confidence curve for a regime shift τ obtained
via method A, with absolute maxima of monitoring log-likelihood
bridges, as in Section 4.2. The curve indicates two plausible regions
for τ; one right before 1980 (which may be related to a decrease
in variance, as suggested in Figure 9.1), and the second around
1991 (perhaps a change in the relationship between the HSI and
the Kola winter temperature). Right panel: Estimates of HSI using
the previous year’s winter temperature, via (9.2), before and after
the estimated regime shift.
We point out that a similar study, also involving the Kola winter temperature,
is given in Hermansen et al. (2016), and that an investigation of structural breaks
for this series is also conducted by Kjesbu et al. (2014); these studies tentatively
identify a potential departure in the pattern connecting Kola temperature and HSI
CONFIDENCE FOR CHANGE-POINTS 31
model
b
β
0
b
β
kola
bρ bσ
global 4.85 (0.85) 0.29 (0.16) 0.86 (0.06) 0.68 (0.05)
left 5.09 (0.77) 0.37 (0.16) 0.78 (0.08) 0.62 (0.05)
right 6.36 (2.26) 0.30 (0.48) 0.39 (0.27) 0.71 (0.11)
Table 9.1. Estimated parameters (with standard errors in parentheses) for the
model defined in (9.2), using the complete series observations, i.e. no
change points (global), and the for the two sets 1922–1991 (left) and
1992–2012 (right) corresponding to the estimated change point be-
tween 1991 and 1992. The two most striking changes are the reversed
influence of Kola temperature and change in correlation after 1991.
in beginning of the 1980s. Also, Vasilakopoulos & Marshall (2015) identified a regime
shift having taken place in 1981 using principal component analyses on 13 North-
East Atlantic cod population descriptors (including HSI) and 5 so-called stressors
(also including Kola temperature). According to these authors, the shift in the early
1980s was largely driven by the combined effect of low temperature, high mortality
rate and low stock size. Our methodology is capable not only of estimating the
location of a potential change-point, but also to supplement such estimates with a
measure of uncertainty using confidence sets; such questions are not touched upon
in these other studies.
10. Concluding remarks
Below we offer a few concluding remarks, some pointing to further relevant
research questions.
10.1. Approximations and related approaches. For our general method B we
have relied on straight simulations to compute the required probabilities and confi-
dence curves, as with (3.3). This brute force method works well, but approximations
to the distributions of both the ML estimator bτ and the deviance statistic D(τ, Y )
can be worked with too; these are by necessity more complicated than the usual
results concerning limiting normality and chi-squared-ness of deviances valid for
continuous parameters of smooth parametric models. Such results have however the
potential to both speed up calculations of confidence curves and to yield additional
insights, also when it comes to comparing performances of different strategies. Meth-
ods initially worked with in Hinkley (1970), Hinkley & Hinkley (1970) and later on
by Cobb (1978), Worsley (1986) and other authors are relevant here, and have the
potential for being developed and finessed further. These lead in particular to cer-
tain approximations for the case where both τ and n τ are large. Such envisioned
results ought also to shed more light on questions of performance and for theoretical
comparison of different confidence curve constructions.
32 CONFIDENCE FOR CHANGE-POINTS
Notably, Siegmund (1988) discusses the performance of several methods in a
single change-point setting. He starts by comparing five different methods for the
simple situation where the change-point τ is the only unknown parameter, i.e. when
θ
L
and θ
R
are known. He also presents a method for the more general (and inter-
esting) case, where θ
L
and θ
R
are unknown. The method produces exact confidence
sets and is related to our method B. It can be re-written as a confidence curve and
in our notation as
cc(τ, y
obs
) = P
τ
{D(τ, Y ) < D(τ, y
obs
) |
b
θ
L
(τ),
b
θ
R
(τ)}. (10.1)
The method is restricted to models within the exponential family, where we have
sufficient statistics for the θ parameters and where one thus obtains a probability
only dependent on τ by conditioning on the ML estimates
b
θ
L
(τ) and
b
θ
R
(τ) for each
τ value. In practice this requires the user to simulate copies Y
of the dataset from
the conditional distribution of Y |(
b
θ
L
(τ),
b
θ
R
(τ)), as opposed to our method B where
data are generated from f(y,
b
θ
L
) and f(y,
b
θ
R
), with the ML estimators
b
θ
L
=
b
θ
L
(bτ)
and
b
θ
R
=
b
θ
R
(bτ). We have not yet undertaken a thorough comparison between our
method B and Siegmund’s method, but our initial investigations suggest that the
two methods give very similar confidence curves in many cases. However, when
either τ or n τ are small, confidence sets from Siegmund’s method appear to
obtain more correct coverage than method B. This is not surprising as our method
relies on estimating the θ parameters sufficiently well. Contrary to our method B,
Siegmund’s method is restricted to the class of exponential family models and is also
more difficult to use in some cases, as generating datasets from Y |(
b
θ
L
(τ),
b
θ
R
(τ)) can
be complicated. Siegmund (1988) also provides approximations to the conditional
probability in (10.1). Again these rely on both τ and n τ being large, and remain
yet to be compared with our methods.
10.2. Combination of information. There are sometimes several sources of infor-
mation about a given change-point. In our analysis of Tirant lo Blanch in Section 7,
for example, we investigated how the change of authors is reflected in aspects of the
distribution of word lengths per chapter, such as the the mean word length chapter
by chapter. There it is also worthwhile examining the sentence length distribution,
through the chapters, to see if a change of author style can be detected there. Via a
suitable R script operating on an electronic version of the Catalan 1490 manuscript
we have indeed gotten hold of the string of the manuscript’s 17593 sentence lengths.
The mean sentence lengths, chapter by chapter, can be modelled as normally dis-
tributed on both sides of τ with (possibly) different mean and variance parameters
and with the variance depending on m
0
i
, the number of sentences in chapter i. The
CONFIDENCE FOR CHANGE-POINTS 33
mean word length and mean sentence length can be analysed separately by the meth-
ods developed in this paper, and as they can be considered independent sources of
information, their inference on τ may be combined. One potential strategy is to use
ideas related to combination of p-value functions in Liu et al. (2014), for a particu-
lar case involving discrete distributions, but there are better methods, as shown in
Cunen & Hjort (2015), Cunen & Hjort (2016). The parallel for the present case is
to stay with the log-likelihood profiles, naturally extending method B. Let `
prof,1
(τ)
and `
prof,2
be the profiled log-likelihoods function from information sources 1 and 2.
These can be summed to `
prof,comb
(τ) = `
prof,1
(τ) + `
prof,2
(τ), from which we can find
the combined maximum likelihood estimator bτ and construct the combined deviance
function, say D
comb
(τ, Y ) = 2{`
prof,comb
(bτ) `
prof,comb
(τ)}. Then we can simulate its
distribution at each position τ by generating a large number of datasets Y
1
and Y
2
based on the first and second data source respectively. The result of the combination
varies from case to case; if the two sources produce the same τ estimate then the
combined confidence curve will also point to the same number, but with slimmer/s-
maller confidence sets, reflecting the increase in information. If the two sources
have different estimates of τ , the combined confidence curve may give an estimate
between the two sources (a compromise), but it may also favour the estimate from
one source over the other. This is exactly what we observe with Tirant lo Blanch;
in Figure 10.1 we see that the sentence length data indicate a much earlier change
of author than the word length dataset (see also Section 7). The combined confi-
dence curve is quite similar to the one from the word length data; the log-likelihoods
and deviances thus appear to judge this source more informative than the sentence
length data.
10.3. More than one change-point. The focus of our paper has been that of
inference for a single change-point in a sequence of observations, under the operating
assumption that precisely one such change-point exists. Sometimes there are strong
a priori reasons for this, as with our application story of Section 7. In other cases
it is useful to precede a change-point analysis with a test for full homogeneity;
only when the data sequence fails such a test is it meaningful to go hunting for
change-points. In various applications there may also be more than one breakpoint
present. Some of our methods may be extended to cover such cases too, calling also
for additional tools, such as model selection mechanisms to decide on the ‘right’
number of parameter discontinuities.
Our methods can be extended to the case of multiple change-points, but both
become more complicated. In some cases we may perform a test, or have some a
priori reasons to expect a specific number of change-points, for example two, say
τ
1
and τ
2
(and assuming τ
1
< τ
2
). Our method A then corresponds to identifying
34 CONFIDENCE FOR CHANGE-POINTS
200 250 300 350
0.0 0.2 0.4 0.6 0.8 1.0
τ
confidence sets
Figure 10.1. Confidence curves for the change-of-author-point τ, the dot-dashed
line is the curve based on the mean sentence length in each chapter,
the dashed line is the curve based on the mean word length in each
chapter, and the full line is the combined confidence curve.
confidence regions, at level α, in the following way (see corresponding formula (2.1))
R(α) = {τ
1
, τ
2
: H
1
1
accepted at level α
1/3
, H
τ
1
+1
2
accepted at level α
1/3
,
H
τ
2
+1,n
accepted at level α
1/3
}
= {τ
1
, τ
2
: Z
1
1
G
1
1
1
(α
1/3
), Z
τ
1
+1
2
G
1
τ
1
+1
2
(α
1/3
), Z
τ
2
+1,n
G
1
τ
2
+1,n
(α
1/3
)}.
This will produce joint confidence regions for τ
1
and τ
2
. For method B, however, it is
more natural to consider confidence curves for each of the change-points separately.
With two change-points the likelihood takes the form
`(τ
1
, τ
2
, θ
L
, θ
M
, θ
R
) =
τ
1
X
i=1
log f(y
i
, θ
L
) +
τ
2
X
i=τ
1
+1
log f(y
i
, θ
M
) +
n
X
i=τ
2
+1
log f(y
i
, θ
R
),
where θ
M
is the model parameter between the two change-points. In order to con-
struct a confidence curve for one of the two change-points, say τ
1
, we need (as before)
the profile log-likelihood function,
`
prof
(τ
1
) = max
τ
2
L
M
R
`(τ
1
, τ
2
, θ
L
, θ
M
, θ
R
),
(10.2)
requiring `(τ
1
, τ
2
, θ
L
, θ
M
, θ
R
) to be maximised over θ
L
, θ
M
, and θ
R
as before, but also
over all possible values of τ
2
. The confidence curve is constructed in a similar way
as before, but in this case the distribution of the deviance will be depending on τ
2
and the success of our simulation recipe will then depend on how well we estimate τ
2
from the data. Neither of these two suggestions has been tried out in detail. Further
work in these directions could possibly follow ideas from Schweder (1976), Yao &
CONFIDENCE FOR CHANGE-POINTS 35
Au (1989), Bai & Perron (1998) and Braun et al. (2000). However, these articles do
not treat change-points very generally. Yao & Au (1989) and Braun et al. (2000)
consider segmentation problems, while Schweder (1976) and Bai & Perron (1998)
work in a regression setting.
References
Bai, J. & Perron, P. (1998). Estimating and testing linear models with multiple
structural changes. Econometrica 66, 47–78.
Billingsley, P. (1968). Convergence of Probability Measures. New York: Wiley.
Boitsov, V. D., Karsakov, A. L. & Trofimov, A. G. (2012). Atlantic water
temperature and climate in the Barents Sea, 2000–2009. ICES Journal of Marine
Science 69, 833–840.
Brander, K. M. (2010). Impacts of climate change on fisheries. Journal of Marine
Systems 79, 389–402.
Braun, J. V., Braun, R. & M
¨
uller, H.-G. (2000). Multiple changepoint fitting
via quasilikelihood, with application to DNA sequence segmentation. Biometrika
87, 301–314.
Carlin, B., Gelfand, E. & Smith, A. F. M. (1992). Hierarchical Bayesian
analysis of changepoint problems. Applied Statistics 41, 389–406.
Carlstein, E., M
¨
uller, H. & Siegmund, D. (1994). Change-Point Problems.
New York: Institute of Mathematical Statistics.
Chen, H. & Zhang, N. (2015). Graph-based change-point detection. Annals of
Statistics 43, 139–176.
Claeskens, G. & Hjort, N. L. (2008). Model Selection and Model Averaging.
Cambridge: Cambridge University Press.
Cobb, G. W. (1978). The problem of the Nile: Conditional solution to a change-
point problem. Biometrika 65, 243–251.
Cox, D. R. & Spjøtvoll, E. (1982). On partitioning means into groups. Scan-
dinavian Journal of Statistics 9, 147–152.
Cs
¨
org
˝
o, S. & Faraway, J. (1996). The exact and asymptotic distributions of
Cram´er–von Mises statistics. Journal of the Royal Statistical Society Series B 58,
221–234.
Cunen, C. & Hjort, N. L. (2015). Optimal inference via confidence distribu-
tions for two-by-two tables modelled as Poisson pairs: Fixed and random effects.
In Proceedings ofthe 60th World Statistics Congress, F. Samaniego, ed. Rio de
Janeiro: International Statistical Institute, pp. 3581–3586.
36 CONFIDENCE FOR CHANGE-POINTS
Cunen, C. & Hjort, N. L. (2016). Combining information across diverse sources:
The ii-cc-ff paradigm. In JSM Proceedings. Alexandria, VA: American Statistical
Association.
Efron, B. (2015). Frequentist accuracy of Bayes estimators. Journal of the Royal
Statistical Society Series B 77, 617–646.
Fearnhead, P. (2006). Exact and efficient Bayesian inference for multiple change-
point problems. Statistics and Computing 16, 203–213.
Fraser, D. A. S. (2011). Is Bayes posterior just quick and dirty confidence? [with
discussion and a rejoinder]. Statistial Science 26, 249–316.
Frick, S., Munk, A. & Sieling, H. (2014). Multiscale change-point inference
[with discussion contributions]. Journal of the Royal Statistical Society Series B
76, 495–580.
Frigessi, A. & Hjort, N. L. (2002). Statistical models and methods for discon-
tinuous phenomena. Journal of Nonparametric Statistics 14, 1–6.
Fukuyama, F. (1992). The End of History and the Last Man. Simon and Schuster.
Gir
´
on, J., Ginebra, J. & Riba, A. (2005). Bayesian analysis of a multinomial
sequence and homogeneity of literary style. American Statistician 59, 19–30.
Gladwell, M. (2000). The Tipping Point: How Little Things Can Make a Big
Difference. New York: Little Brown.
Gould, S. J. & Eldredge, N. (1977). Punctuated equilibria: the tempo and
mode of evolution reconsidered. Paleobiology 3, 115–151.
Hermansen, G. H., Hjort, N. L. & Kjesbu, O. S. (2016). Recent advances
in statistical methodology applied to the Hjort liver index time series (1859-2012)
and associated influential factors. Canadian Journal of Fisheries and Aquatic
Sciences 73, 279–295.
Hinkley, D. V. (1970). Inference about the change-point in a sequence of random
variables. Biometrika 57, 1–17.
Hinkley, D. V. & Hinkley, E. A. (1970). Inference about the change-point in
a sequence of binomial random variables. Biometrika 57, 477–488.
Hjort, J. (1914). Fluctuations in the Great Fisheries of the Northern Europe
Viewed in the Light of Biological Research. Copenhagen: Rapports et Proc`es-
Verbaux des eunions du Conseil International pour l’Exploration de la Mer.
Hjort, N. L. (2007). And Quiet Does Not Flow the Don: Statistical analysis of
a quarrel between Nobel laureates. In Concilience, W. Østreng, ed. Oslo: Centre
for Advanced Research, pp. 134–140.
Hjort, N. L., Holmes, C., M
¨
uller, P. & Walker, S. G. (2010). Bayesian
Nonparametrics. Cambridge University Press.
CONFIDENCE FOR CHANGE-POINTS 37
Hjort, N. L. & Koning, A. (2002). Tests for constancy of model parameters over
time. Journal of Nonparametric Statistics 14, 113–132.
Jarrett, R. G. (1979). A note on the intervals between coal-mining disasters.
Biometrika 66, 191–193.
King, J. R. (2005). Report of the study group on fisheries and ecosystem responses
to recent regime shifts. Tech. Rep. 28, North Pacific Marine Science Organization.
Kjesbu, O. S., Opdal, A. F., Korsbrekke, K., Devine, J. A. &
Skjæraasen, J. E. (2014). Making use of Johan Hjort’s ‘unknown’ legacy:
reconstruction of a 150-year coastal time series on Northeast Arctic cod (Gadus
Morhua) liver data reveals long-term trends in energy allocation patterns. ICES
Journal of Marine Science 71, 2053–2063.
Koziol, J. A. (2014). A note on change-point estimation in a multinomial sequence.
Enliven: Biostatistics and Metrics 1, 1–4.
Liu, D., Liu, R. & Xie, M. (2014). Exact meta-analysis approach for discrete
data and its application to 2 ×2 tables with rare events. Journal of the American
Statistical Association 109, 1450–1465.
Mardia, K. V., Kent, J. T. & Bibby, J. M. (1979). Multivariate Analysis. New
York: Academic Press.
Riba, A. & Ginebra, J. (2005). Change-point estimation in a multinomial se-
quence and homogeneity of literary style. Journal of Applied Statistics 32, 61–74.
Rosenthal, D. H. (1984). Tirant lo Blanc: Foreword to the new translation.
Schweder, T. (1976). Some“optimal”methods to detect structural shift or outliers
in regression. Journal of the American Statistical Association 71, 491–501.
Schweder, T. & Hjort, N. L. (2016). Confidence, Likelihood, Probability. Cam-
bridge: Cambridge University Press.
Siegmund, D. (1988). Confidence sets in change-point problems. International
Statistical Review/Revue Internationale de Statistique 56, 31–48.
Spengler, O. (1918). Der Untergang des Abendlandes. Wien: Braum
¨
uller.
Vasilakopoulos, P. & Marshall, C. T. (2015). Resilience and tipping points
of an exploited fish population over six decades. Global Change Biology 21, 1834–
1847.
Worsley, K. J. (1986). Confidence regions and tests for a change-point in a
sequence of exponential family random variables. Biometrika 73, 91–104.
Xie, M.-g. & Singh, K. (2013). Confidence distribution, the frequentist distri-
bution estimator of a parameter: a review. International Statistical Review 81,
3–39.
Yao, Y.-C. & Au, S. (1989). Least-squares estimation of a step function. Sankhy¯a:
The Indian Journal of Statistics, Series A , 370–381.